巴西专利BR112014018434B1 BUFFER MANAGEMENT FOR PARALLEL GRAPHICS PROCESSING UNIT

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
buffer management for parallel graphics processing unit. the techniques are generally related to a method for performing graphical data processing operations in parallel and also to the manner of channeling. a first execution stream is executed in a first unit (28a) of, for example, a gpu shading processor (26) and a second execution stream is executed in parallel in a second unit (28n). the data produced by the execution of the first execution stream is then consumed by the second unit that executes the second execution stream. a management unit (18) within an ic (16) including the gpu receives a request from the first unit to store the data produced by it in a store (22a-22n) in a global memory external to the ic, the store comprising a fifo store, an example of it being an null store, and determines a location where the data produced by the execution of the first execution stream will be stored. upon receiving a request from the second unit to retrieve the data produced by the execution of the first execution stream, the management unit determines whether the data from the first execution stream is available for retrieval for consumption by the second execution stream.
公开号:BR112014018434B1
申请号:R112014018434-8
申请日:2013-01-24
公开日:2021-07-27
发明作者:Alexei V. Bourd；Vineet Goel
申请人:Qualcomm Incorporated；
IPC主号:

专利说明:

[0001] This application claims the benefit of US Provisional Application 61/591,733 filed January 27, 2012, the entire contents of which are hereby incorporated in its entirety by way of reference. TECHNICAL FIELD
[0002] This disclosure refers to memory access management and, more specifically, memory access management in graphics processing apparatus (GPUs) BACKGROUND
[0003] Graphics processing units (GPUs) are being used for purposes other than graphics processing. For example, non-graphics applications can run faster by exploiting the massive parallelism of a GPU. This has led to GPUs that feature processing functionality unrelated to graphics and are referred to as general purpose GPUs (GPGPUs). For example, a GPGPU includes one or more shading cores, and the shading cores are configured to run applications such as graphics-related applications as well as non-graphics-related applications. SUMMARY
[0004] In general, this disclosure is related to techniques for managing a store that is in a global memory and that stores data for a graphics processing unit (GPU) with the GPU. For example, an integrated circuit (IC) chip that includes the GPU includes a channel management unit. The pipeline management unit can be configured to maintain state information for one or more stores in global memory. When an application running on the GPU is about to access stores in global memory, state information about stores in global memory may be available inside the IC chip. This way it is not necessary for the GPU to access the non-embedded memory in order to determine the status information of the stores in global memory.
[0005] In one example, the disclosure describes a method for performing data processing operations in the manner of channeling. The method includes executing a first execution stream on a first programmable computing unit of a graphics processor of specific operations of a graphics processing unit (GPU) and executing a second execution stream on a second programmable computing unit of the graphics processor. GPU-specific operations. The method also includes receiving, with a management unit within an integrated circuit (IC) that includes the GPU, a request from the first programmable computing unit to store data produced by executing the first execution stream in a store in an external global memory to the IC. In this example, the data produced by executing the first execution stream will be consumed by the second programmable computation unit that executes the second execution stream. Furthermore, in this example, the store comprises one of a first in first out (FIFO) store and an annular store. The method also includes determining, with the management unit, the location within the store where the data produced by the execution of the first execution stream will be stored, and storing, with the IC, the data produced by the execution of the first execution stream at the location determined within the store.
[0006] In one example, the revelation describes an equipment. The equipment includes a global memory that includes a store. In this example, the store comprises one of a first in first outside (FIFO) store and an annular store. The equipment also includes an integrated circuit (IC) that includes a graphics processing unit (GPU) and a management unit. The GPU includes a first programmable computing unit configured to execute a first stream of execution and a second programmable computing unit configured to execute a second stream of execution. The management unit is configured to receive a request from the first programmable computing unit to store the data produced by executing the first execution stream in the store in global memory. In this example, the data produced by executing the first execution stream will be consumed by the second programmable computation unit that executes the second execution stream. The management unit is also configured to determine the location within the store where the data produced by the execution of the first execution stream will be stored. In this example, the CI is configured to store the data produced by the execution of the first execution stream at the determined location within the store.
[0007] In one example, the revelation describes an equipment. The equipment includes a global memory and an integrated circuit (IC). Global memory includes a store. In this example, the store comprises one of a first in first out (FIFO) store and an annular store. The IC includes a graphics processing unit (GPU) which comprises a device for executing a first stream of execution and a device for executing a second stream of execution. The IC also includes a device for receiving from the device for executing the first execution stream a request to store the data produced by executing the first execution stream in the store in global memory. In this example, the data produced by the execution of the first execution stream will be consumed by the device to execute the second execution stream. The IC also includes a device for determining the location within the store the data produced by the device for executing the first execution stream will be stored, and a device for storing the data produced by executing the first execution stream at the determined location within the store.
[0008] In one example, the disclosure describes a computer-readable storage medium that has instructions stored on it that, when executed, cause one or more processors to execute a first stream of execution in a first programmable computing unit of the graphics processor operations specific to a graphics processing unit (GPU) and execute a second stream of execution on a second GPU operations-specific graphics processor programmable computing unit. The instructions cause the processor or processors to receive with a management unit within an integrated circuit (IC) that includes the GPU, a request by the first programmable computing unit to store the data produced by executing the first execution stream in a store in a global memory external to the IC. In this example, the data produced by executing the first execution stream will be consumed by the second programmable computation unit that executes the second execution stream. Furthermore, in this example, the store comprises one of a first in first outside store and an annular store. The instructions also cause the processor or processors to determine, with the management unit, the location within the store where the data produced by the execution of the first execution flow will be stored and, to store, with the IC, the data produced by the execution of the first execution stream at the determined location within the store.
[0009] Details of one or more examples are presented in the accompanying drawings and in the description that follows. Other features, objects and advantages will become evident from the description and drawings and from the claims. BRIEF DESCRIPTION OF THE DRAWINGS
[0010] Figure 1 is a block diagram showing an example of an apparatus according to one or more examples described in this disclosure.
[0011] Figure 2 is a block diagram that shows a graphics processing unit (GPU) and a global memory in more detail.
[0012] Figure 3 is a flowchart showing an exemplary technique in accordance with one or more examples described in this disclosure.
[0013] Figure 4 is a flowchart showing another exemplary technique according to one or more examples described in this disclosure.
[0014] Figure 5 is a block diagram showing an apparatus of Figure 1 in more detail. DETAILED DESCRIPTION
[0015] A graphics processing unit (GPU) may include an operations-specific graphics processor that is configured to run one or more applications. Examples of these applications include shading programs such as vertex shaders, wrap shaders, fragment shaders, geometry shaders, and other applications such as those related to graphics processing. In addition, some application developers may find it beneficial to exploit massive GPU parallelism and run non-graphics-related applications on the GPU. For example, the processing parallelism provided by a GPU may be adequate to perform parallel matrix operations, even when the matrix operations are unrelated to graphics processing. Other examples of non-graphics applications include techniques related to fluid dynamics or linear algebra where rapid execution of parallel operations can be beneficial. Non-graphics related applications can also run on the graphics processor for specific operations.
[0016] A GPU that is capable of running such non-graphics related applications can be considered as a general process GPU (GPU). For example, when a GPU is running non-graphics related applications, the GPU is functioning as a GPGPU. Most all GPUs can be configured as a GPGPU.
[0017] For purposes of illustration, this disclosure describes techniques with respect to a GPU that functions as a GPGPU. However, the techniques are not limited to instances where the GPU is functioning as a GPGPU (ie, running non-graphics-related applications), and the techniques can also apply to instances where the GPU is running graphics-related applications. . Furthermore, the techniques described in this disclosure can be implemented by any type of processing unit (CPU), an accelerator or any other custom apparatus. Although the techniques are described to a GPU, it should be understood that the techniques are extensible to other types of processing units.
[0018] The graphics processor for specific operations within the GPU may include a series of shading cores (also referred to as programmable computing units to indicate that these cores can execute instructions for both graphics-related and non-graphics-related applications). Each of the programmable computing units may include local memory reserved for instructions to be executed by that programmable computing unit, as well as data produced by the execution of the instructions, such as intermediate results produced during the execution of the execution streams. The local memory of the programmable computing unit may be inaccessible by other programmable computing units. In some cases different applications that will run on the GPU can be run by different programmable computing units.
[0019] In the techniques described in this disclosure, graphics-related applications are referred to as shaders, and non-graphics-related applications are referred to as cores. For example, examples of shaders (ie graphics related applications) include, but are not limited to, a vertex shader, a fragment shader, and the geometry shader. Examples of cores (ie, non-graphics related applications) include applications to perform matrix operations, fluid dynamics, image processing operations, video processing operations, and the like.
[0020] Furthermore, cores need not necessarily be limited to just applications that are run by the GPU and also include fixed-function units (ie, non-programmable GPU units). For purposes of example only, the techniques described in this disclosure are described with respect to cores which are applications that run on the GPU. For example, techniques are described with respect to non-graphics related applications running on the graphics processor for specific operations of a GPU so that the GPU functions like a GPGPU.
[0021] A core can include a series of workgroups, tasks, or execution streams (all of which are used synonymously in this disclosure). For example, an execution stream can be a set of kernel instructions and can be executed independently of other kernel execution streams. In some examples, to run a kernel one or more of the programmable computation units can each run one or more kernel execution streams. For example, a first programmable computing unit can execute a first stream of execution from the core, and a second programmable computing unit can execute a second stream of execution from the same core. In some examples, one programmable computing unit may run one or more streams of execution from one core, while another programmable computing unit runs one or more streams of execution from another core. In some instances, a combination of the two might be possible (that is, some programmable computing units run different streams of execution from the same core, while some other programmable computing units run different streams of execution).
[0022] In general, the GPU can be configured to implement a single multiple program data (SPMD) programming model. In the SPMD programming model, the GPU can execute a core in several programmable computation units (such as execution streams, for example), where each programmable computation unit performs functions on its own data. Furthermore, in the SPMD programming model, the programmable computing units include respective program counters that indicate the actual instructions that are executed by the programmable computing units.
[0023] Although GPUs provide massive parallelism for processing, GPUs may not be well suited to running cores in a piped fashion. Running cores piped way means running cores so that the data produced by one core is consumed by another core. As another example, running cores pipelined means running a core execution stream that produces data that will be consumed by another execution stream from the same core. In this disclosure, an execution flow that produces the data can be referred to as a producer execution flow and the execution flow that receives the data can be referred to as a consumer execution flow.
[0024] In some examples, the producer execution flow and the consumer execution flow can be execution flow of the same core. In some examples, the producer execution stream and the consumer execution stream might be different core execution streams. In these examples, the kernel that includes the producer execution stream may be referred to as the producer kernel, and the kernel that includes the consumer execution stream may be referred to as the consumer kernel.
[0025] For example, running cores as a pipe can be thought of as a first execution stream (a stream of execution producing a core, for example) that produces data that is consumed by a second stream of execution (a stream of consumer execution of the same core or a different core, for example). This second execution flow (which was a consumer for the first execution flow) can be a producer execution flow for a third execution flow (the second execution flow produces data that will be consumed by the third execution flow, for example) . The third execution stream can be an execution stream to a different core than the core that includes the first and second execution streams, or it can be an execution stream to one of the cores that includes the first and second execution streams. In this example, the first, second, and third streams of execution can be thought of as forming a processing pipeline.
[0026] The execution of cores in the pipeline manner should not be interpreted as requiring that the cores or execution streams be executed serially (one after the other, for example). In the example above, for example, it is possible for the GPU to run two or more of the first, second, and third execution streams in parallel (at the same time, for example). However, it is also possible for the GPU to run the execution streams serially and still be considered to run the cores in a piped fashion.
[0027] It may be necessary for a programmable computing unit, which executes a stream of execution producing a core, to transmit the data produced to global memory (i.e., system memory, non-embedded, external to the integrated circuit (IC) that includes the GPU), where global memory can be accessible via a system bus, for example. Another programmable computation unit, which runs a stream of consumer execution from the same or a different kernel, may need to receive the output data from global memory. As described in more detail, for existing GPUs, global memory management can be inefficient in terms of computation, time and/or energy which results in poor performance when running cores in a piped fashion.
[0028] This disclosure describes techniques for efficient computational, time and energy management of global memory. As described in more detail, the integrated circuit (IC) that includes the GPU may include a channel management unit (PMU). Alternatively, the GPU itself can include the PMU. The PMU can be configured to manage the state information of the global memory that stores the produced data that will be consumed. For example, a processor or the GPG itself can reserve locations within global memory where data produced by programmable computing units will be stored in global memory. These reserved locations within global memory can be thought of as a series of stores. In some examples, the series of stores may form an annular store or a first in first out (FIFO) store. An annular store can be considered as an example of a FIFO store.
[0029] The PMU can store information inside the IC or inside the GPU (in a built-in cache memory, for example) that indicate information about the state of the stores in the non-embedded global memory. As an example, the PMU can store information indicating the starting address and terminal address of the stores in global memory. As another example, the PMU can store the store address within the series of stores where the produced data will be stored, as well as the store address within the series of stores where the data to be consumed will be read. As yet another example, the PMU can store information that indicates whether a producer core has completed producing data, so that the programmable computing unit that is running a stream of execution from the consuming core that needs the data can proceed with the execution of other consumer core execution streams that don't need the data.
[0030] In the techniques described in this disclosure, the PMU may receive a request to store the data produced by a producer execution stream in the store and may receive a request to retrieve the data produced by the producer execution stream from the store for consumption by a stream of consumer execution. The PMU can determine the location within the store where the data produced by the execution of the producer execution stream will be stored based on the stored state information from the store and determine the location within the store from which the data to be consumed by the consuming core will be retrieved based on stored state information from the stores.
[0031] By managing global memory state information with information stored within the IC that includes the GPU or within the GPU itself, the techniques described in this disclosure can minimize the number of times the GPU needs to access global memory . For example, the PMU may not need to determine the addresses from which data will be stored or retrieved by accessing such information within the IC that includes the GPU. Minimizing the number of times the GPU needs to access global memory can reduce power consumption, reduce system bus bandwidth load and sticky latency.
[0032] Furthermore, as described in more detail below, existing GPUs do not need the cores to include instructions that manage global memory. The GPU can scatter clock cycles by executing such global memory management information, which can be computationally ineffective. With the PMU managing global memory state information it may not be necessary for cores to include global memory management instructions, which results in less complex kernel instructions as well as fewer kernel instructions that need to be executed. In this way, the techniques described in this disclosure can promote computational efficiency.
[0033] Figure 1 is a block diagram showing an example of an apparatus according to one or more examples described in this disclosure. For example, Figure 1 shows a device 10. Examples of a device 10 include, but are not limited to, video devices such as media players, set-top box converters, wireless telephone devices such as mobile phones, personal digital assistants (PDAs), desktop computers, laptop computers, game consoles, video conferencing units, tablet computing devices and the like. Apparatus 10 may include components in addition to those shown in Figure 1.
[0034] As shown, the apparatus 10 includes an integrated circuit (IC) 12 and a global memory 20. The global memory 20 may be considered as the memory for the apparatus 10. The memory 20 may comprise one or more readable media per computer. Examples of global memory 20 include, but are not limited to, random access memory (RAM), electrically erasable programmable read-only memory (EEPROM), flash memory, or any other medium that can be used to carry or store a desired program code in the form of instructions and/or data structures and that can be accessed by a computer or a processor.
[0035] In some respects, global memory 20 may include instructions that cause processor 14 and/or GPU 16 to perform the functions assigned to processor 14 and GPU 16 in this disclosure, therefore, global memory 20 may be a a computer-readable storage medium that has instructions stored on it that, when executed, cause one or more processors (the processor 14 and the GPU 16, for example) to perform various functions.
[0036] The global memory 20 can be considered, in some examples, as a non-transient storage medium. The term “non-transient” may indicate that the storage medium is not embodied in a carrier wave or propagated signal. However, “non-transient” should not be interpreted to mean that global memory 20 is immobile or that its contents are static. As an example, global memory 20 can be removed from device 10 and moved to another device. As another example, a global memory, substantially similar to global memory 20, can be inserted into the device 10. In certain examples, a non-transient storage medium can store data that can, over time, change (in a RAM, for example).
[0037] The IC 12 includes a processor 14, a graphics processing unit (GPU) 16 and a channel management unit (PMU) 18. The IC 12 can be any type of integrated circuit that houses or forms the processor 14 , the GPU 16 and the PMU 18. For example, the IC 12 can be thought of as a processing chip within a chip package. PMU 18 can be a hardware unit that forms part of IC 12 or it can be hardware inside GPU 16. It is possible that PMU 18 is software running on hardware inside IC 12 or inside GPU 16. For purposes of illustration and description, the techniques are described with respect to the PMU 18 as being a hardware unit.
[0038] Although processor 14, GPU 16 and PMU 18 are shown as being part of a single IC 12, aspects of this disclosure are not so limited. In some examples, processor 14 and GPU 16 may be housed on different integrated circuits (ie, different chip packages). In these examples, PMU 18 can be housed on the same integrated circuit as GPU 16. In some examples, PMU 18 can be formed as part of GPU 16. As an example, processor 14 and GPU 16 can be housed on the same circuit integrated (that is, in the same chip package) and the PMU 18 can be formed inside the GPU 16. As another example, the processor 14 and the GPU 16 can be housed in different integrated circuits (i.e., in different chip packages ), the PMU 18 can be formed inside the GPU 16.
[0039] Examples of processor 14, GPU 16, and PMU 18 include, but are not limited to, a digital signal processor (DSP), a general purpose microprocessor, an application-specific integrated circuit (ASIC), an arrangement field-programmable logic (FPGA) or other integrated or discrete logic circuitry. In some examples, the GPU 16 and PMU 18 may be specialized hardware that includes an integrated and/or discrete logic circuitry that provides the GPU 16 with mass parallel processing capabilities suitable for graphics processing that provides the PMU 18 with memory Management Global 20, as described in more detail below. In some examples, GPU 16 may also include general purpose processing and may be referred to as a general purpose GPU (GPGPU) when implementing general purpose processing tasks (ie, non-graphics related tasks).
[0040] Processor 14, sometimes referred to as host, can be the central processing unit (CPU) of apparatus 10. Processor 14 can run various types of applications. Application examples include web browsers, e-readers, email applications, spreadsheets, video games, video replay, audio replay, word processing, or other applications that generate viewable objects for display or any other types of application. Global memory 20 can store instructions for running the application or applications.
[0041] In some examples, the processor 14 can offload processing tasks to the GPU 16, such as tasks that require massive parallel operations. As an example, graphics processing requires massively parallel operations, and processor 14 can offload such graphics processing tasks to the GPU 16. In some examples, processor 14 can offload tasks that are unrelated to the graphics processing for GPU 16. For example, matrix operations require parallel operations, and GPU 16 may be better suited to implement such operations compared to processor 14.
[0042] To implement tasks, the GPU 16 can be configured to run one or more applications. For graphics-related processing, for example, the GPU 16 can run applications such as vertex shaders, fragment shaders, and geometry shaders. For graphics-related processing, the GPU 16 can run applications designed for such processing (such as an application to implement matrix operations or an application for fluid dynamics). For both examples (graphics-related processing or non-graphics-related processing, for example), processor 14 can instruct GPU 16 to run the application or applications.
[0043] The processor 14 can communicate with the GPU 16 according to a specific application processing interface (API). For example, processor 14 can transmit instructions to GPU 16, such as instructions that instruct GPU 16 to run one or more applications using the API. Examples of such APIs include the DirectX® API from Microsoft®, the OpenGL® from the Kronos group, and the OpenCL® from the Kronos group; however, aspects of this disclosure are not limited to DirectX, OpenGL, or OpenCL APIs and may be extended to other API types that have been developed, are currently being developed, or will be developed in the future. Furthermore, it is not necessary for the techniques described in this disclosure to work in accordance with an API, and the processor 14 and the GPU 16 can use any technique for communication.
[0044] As an example, for graphics related applications, the processor 14 can communicate with the GPU 16 using the OpenGL API. For non-graphics applications, processor 14 can communicate with GPU 16 using the OpenCL API. Again, the techniques described in this disclosure do not necessarily require processor 14 to communicate with GPU 16 using the OpenGl and or OpenCl APIs.
[0045] The graphics related applications that the GPU 16 is about to run can be referred to as shaders, and the non-graphics related applications that the GPU 16 is about to run can be referred to as cores. For example, global memory 20 can store shader and core instructions, in a compiler running on processor 14 can convert shader and core instructions into object code for execution on GPU 16. As another example, global memory 20 can store the shaders and nudes object code that the GPU 16 retrieves and executes.
[0046] Examples of shaders include the vertex shader, the fragment shader, and the geometry shader for graphics-related processing. Core examples include applications that are unrelated to graphics processing (for linear algebra or fluid dynamics, for example). As additional examples, the cores include applications for image processing and video processing.
[0047] The GPU 16 can include an operations-specific graphics processor and the operations-specific graphics processor can run the shaders and cores. For example, the operations-specific graphics processor of the GPU 16 may include one or more shading cores (referred to as programmable computing units), and each of the programmable computing unit or units, and each of the programmable computing unit or units. can run a core.
[0048] Although cores are described as being applications that run on GPU 16, cores should not be considered as limiting. Other examples include GPU 16 fixed-function units. For example, GPU 16 includes programmable computing units and fixed-function units. Programmable computing units can provide functional flexibility when running applications. Fixed-function units can be hardware units that do not provide functional flexibility and can be designed for specific purposes. In general, the term core refers to any application or piece of hardware that receives data, processes the data, and transmits the data for non-graphics purposes. For purposes of example, however, the techniques described in this disclosure are described with examples in which cores are applications, it being understood that these techniques are extendable to examples in which cores are fixed function units.
[0049] In the techniques described in this disclosure, instead of one programmable computing unit executing all the instructions of a core, it is possible for several programmable computing units to execute parts of the core. A part of a core can be referred to as a workgroup, task, or execution flow (all are synonymous). For example, a core workgroup, task, or execution flow is a set of instructions that can be executed independently of other core workgroups, tasks, or execution flows.
[0050] In some examples, a first set of one or more programmable computation units may execute one core execution stream, and a second set of one or more programmable computation units may execute one core execution stream. In some cases, the execution streams that the first set of programmable computing units and the second set of programmable computing units execute may be execution streams from the same core. In some cases, the execution streams that the first set of programmable computing units and the second set of programmable computing units execute may be different core execution streams. In either of these examples, it may be necessary for one of the execution streams to transmit the generated data to another of the execution stream cores. In other words, the GPU 16 can run the cores in a piped fashion.
[0051] As described above, for example, running cores in a piped manner can mean running cores so that the data produced by one execution stream is consumed by another execution stream and the data produced by this other execution stream is consumed by yet another stream of execution, and so on. In these examples, the execution streams might be execution streams from different cores, from the same core, or some execution streams might be for different cores, and other execution streams might be for the same core. In these examples, cores can be seen as forming a pipeline in which data is produced and consumed. For example, a first, second, and third execution streams from the same or different cores can form a tube, in which the first execution stream produces data and passes the data to be consumed by the second execution stream for processing. The second execution stream processes the received data to produce data and passes the produced data to the third execution stream for processing, and so on.
[0052] In this example, the first execution flow can be referred to as a producer execution flow, the second execution flow can be referred to as a consumer execution flow for the first execution flow, and a producer execution flow for the third flow execution flow, and the third execution flow can be referred to as a consumer execution flow. In examples where the first, second, and third execution streams are for different cores (such as the first, second, and third cores respectively), the first core can be referred to as the producer core, the second core can be referred to as consumer core for the first core and producing core for the third core, and the third core can be referred to as consumer core.
[0053] In existing GPUs, running cores as a channeling can be ineffective in terms of computation and power. For example, each of the programmable computing units may include a local memory for storing instructions that will be executed by the programmable computing unit, for storing data that will be processed, and for storing data that will be produced, which includes intermediate results that will be produced. However, the programmable computing unit's local memory may not be accessible by any other programmable computing units.
[0054] Therefore, in some examples, to perform core in the manner of channeling, the GPU 16 can retrieve output data stored in the local memory of a programmable computing unit and store the output data in global memory 20. global memory 20 can be referred to as storing data non-embedded since global memory 20 is external and external to the integrated circuit that houses the GPU 16 (i.e. external to IC 12). The GPU 16 can then retrieve the data stored in global memory 20 and load the retrieved data into the local memory of another programmable computing unit.
[0055] As an illustrative example, suppose that a first programmable computation unit is executing an execution flow of a producer core. In this example, the first programmable computing unit can store data produced by executing the execution streams of the producer core in the local memory of the first programmable computing unit. The GPU 16 can retrieve the output data from the local memory of the first programmable computing unit and store the output data in global memory 20.
[0056] In this example, suppose that a second programmable computation unit is executing a consumed core's execution stream. In this example, the GPU 16 can retrieve the data produced by the global memory producing core 20 and load the data into the local memory of the second programmable computing unit. The consuming core can then consume the data stored in the local memory of the second programmable unit.
[0057] In the above example, the GPU 16 may need to store the data produced by the producer core in global memory 20, since the second programmable computing unit does not have access to the local memory of the first programmable computing unit. In this way, global memory 20 functions as an intermediary storage of produced data that will be consumed next.
[0058] In general, managing the way in which the produced data is stored and or the way in which the data will be retrieved from global memory 20 can be ineffective in terms of processing and computation. As an example, it is possible, albeit ineffective, for cores to manage the way in which data is stored in global memory 20. For example, cores might include instructions that make arithmetic logic units (ALUs) of programmable computation units determine the address (indicators, for example) within global memory 20 where the data will be stored or where the stored data will be retrieved.
[0059] As another example, global memory 20 can store an atomic counter. The atomic counter value can indicate whether the data is available for consumption. For example, the producer kernel may include instructions to read the current atomic counter value stored in global memory 20. The producer kernel may also include instructions that modify the atomic counter value based on the amount of data the producer kernel has stored and instructions which write the modified atomic counter value back to global memory 20.
[0060] The consuming core can include instructions to periodically check the atomic counter value stored in global memory 20. When the atomic counter value is sufficiently large, the consuming core can determine that the data to be consumed is available. Suppose, for example, that the value of the atomic counter is X and that the producing nucleus produced N amount of data. In this example, the consumer core can include instructions that cause the programmable computing unit, which is executing the consumer core execution streams, to periodically check the value of the atomic counter. When the programmable computing unit determines the value of the atomic counter to be X + N the programmable computing unit can request the GPU to retrieve the data armed in global memory 20 for consumption.
[0061] In this way, it is possible, with the use of software (ie, the kernel instructions), to execute the kernels in the way of channelling. However, there can be a number of reasons why running the cores in a piped fashion using instructions within the core is ineffective. For example, adding instructions to cores to determine addresses where to store data or where data is stored in global memory 20 may require programmable computing unit ALUs to consume energy unnecessarily, as well as waste clock cycles processing instructions to determine the addresses within global memory 20.
[0062] In addition, periodic checking of the atomic counter value requires the GPU 16 to access non-embedded information (ie, in global memory 20). Reading the atomic counter value from global memory 20 and writing the modified atomic counter value to global memory 20 can consume an undesirable amount of energy. In addition, as shown, IC 12 is coupled to global memory 20 via memory bus 24. There may be bandwidth limitations on the amount of data that memory bus 24 can process. Therefore, there may be a delay when the GPU 16 can read and write the atomic counter value.
[0063] Furthermore, since the time the data is available for consumption by the consumer core is unknown, the programmable computing unit, which runs the consumer core, can periodically cause the GPU 16 to check the value of the atomic counter in order to determine if the data is available for consumption. Periodic checking of the atomic counter value can cause the consumer core execution streams to remain “spinning”. For example, if the value read from the atomic counter indicates that the data is not yet completely available for consumption, the programmable computing unit may cause execution of the consumer core execution flows until the programmable computing unit checks again the value of the atomic counter. If the data is still not available, the programmable computing unit waits again and causes the GPU 16 to check again if the data is available. In this example, the consumer core execution streams can remain in the busy/standby state for as long as the data to be consumed is not available in global memory 20. In other words, during rotation, the programmable computing unit can not playing any role, which can slow down data consumption.
[0064] If the frequency at which the programmable computing unit determines whether data is available (by reading the atomic counter value, for example) is high, then the GPU 16 can waste energy by frequently reading the atomic counter value stored in global memory 20. If the frequency at which the programmable computing unit determines whether data is available is low, then there may be wasted time between when the data is available and when the GPU 16 retrieves the data , which also slows down data consumption.
[0065] Furthermore, in some of the above techniques in which global memory 20 stores the atomic counter, when a core is reading, modifying and writing the value of the atomic counter, no other core is authorized to read, modify or write the value of the atomic counter. In such cases, when two producer execution streams need to transmit data to global memory storage 20 at the same time, one of the execution streams will be able to transmit data, but the other execution stream may not be able to transmit data since this other execution flow might not be able to access the atomic counter. In such cases, the execution stream, which was denied storage access, can rotate until access to the atomic counter is available and, when the atomic counter is available for access, the execution stream, which has been denied access to storage, can pass the data to global memory 20. The same can happen when two consumer execution streams try to access data at the same time.
[0066] The techniques described in this disclosure may allow the GPU 16 to more effectively perform cores in the manner of channeling, compared to the techniques described above. As described in more detail, the pipeline management unit (PMU) 18 can be configured to store information about the state of the data that is produced by the various execution streams and the data that will be consumed by the various execution streams. In this way, the GPU 16 may not need to continually access non-embedded information that indicates where data is stored and when data is available for consumption. Instead, PMU 18 can store such information internally (ie, inside IC 12).
[0067] As shown, global memory 20 may include stores 22A-22N (collectively referred to as stores 22). Stores 22 may be storage locations within global memory 20. Examples of stores 22 include a first out first in (FIFO) store or an annular store.
[0068] Processor 14 may be configured to define the number of stores residing within global memory 20 and reserve storage locations within global memory 20. For example, processor 14 may define start and end locations of stores 22 (i.e. is, starting and ending addresses). Processor 14 can define the number of stores that reside within global memory 20 based on the number of programmable computing units that reside within GPU operations-specific graphics processor 16. As an example, processor 14 can define the number of stores that reside within global memory 20 such that there are one or more input stores 22 for each programmable computing unit (i.e., one or more stores that store data to be consumed by cores running in the programmable computing units) and zero or more output stores 22 for each programmable computing unit (i.e., zero or more stores that store data produced by cores running on the programmable computing units of the GPU 16).
[0069] In addition, the processor 14 can be configured to define the size of the stores. For example, processor 14 can be configured to define the number and locations of storage within each of stores 22 (the length of stores 22, for example). Processor 14 can also define the amount of data in each of the storage locations (the width of the stores 22, for example). In some examples, processor 14 may pre-populate stores 22 with data.
[0070] In some examples, processor 14 can be configured to set a minimum number of stores 22. As an example, processor 14 can be configured to set a minimum of 128 stores 22. The minimum number of stores 22, which is 128, is presented for purposes of illustration and should not be considered limiting. The minimum number of stores 22 may be greater or less than 128. In some examples, there may not be a requirement for a minimum number of stores 22. Processor 14 may also be configured to execute several instructions to determine the condition of stores 22. For example, processor 14 can execute instructions for copying data stored in stores in stores within IC 12 or GPU 16 into instructions for copying data stored within stores within IC or GPU 16 into stores 22. Processor 14 may also execute instructions that define the amount of data stored in stores 22, as well as instructions that confirm the length and width of stores 22 (in order to ensure that stores 22 have not been corrupted, for example). Such execution of instructions that allow processor 14 to determine the condition of stores 22 is not necessary in each example, but could potentially help the kernel developer to determine the condition of stores 22 by executing instructions on processor 14 rather than the GPU. 16.
[0072] In some examples, the processor 14 can be configured to set an amplification factor for the stores 22. The amplification factor can indicate the maximum number of elements that can be produced by an execution stream of a core for storage in one of the stores 22. The amplification factor may be needed for situations where one of the stores 22 that will store data cannot store all of the data produced. This can result in a kernel's execution halting due to insufficient storage space in stores 22 and can lead to deadlock (in case the kernel never goes back to running state, for example).
[0073] To minimize the possibilities of such an impasse, the processor 14 can reserve large parts of global memory 20, (such as defining long and wide stores 22 that are large enough to store mostly any kind of Dice). This may work well in some cases, but not so well in others, where reserving large parts of global memory 20 may not be possible. In some cases, developers can design cores so that the cores don't overproduce data, thus minimizing the chances of deadlock.
[0074] Although processor 14 is described as defining stores 22, the techniques described in this disclosure are not so limited. In some examples, a processing unit other than the processor 14 can be configured to define the stores 22. In some examples, it is possible for the GPU 16 to define the stores 22. However, for ease of description, the techniques are described in relation to to the definition of the stores 22 by the processor 14.
[0075] The processor 14 can transmit the information from the stores 22 to the channel processing unit (PMU) 18, for example, the PMU 18 can receive information indicating the number of stores 22, the start and end addresses of the stores 22 , the length and width of the stores 22 and any other information that the processor 14 has determined for the stores 22. The PMU 18 can store such information about the state of the stores 22 within registers located within the IC 12. With the information of the stores 22 of processor 14, PMU 18 can be configured to manage state information from stores 22 as a stream of execution of cores executed in programmable computing units that produce and consume data.
[0076] For example, after a programmable computing unit, which executes execution flows of a core, produces data and transmits the produced data, the PMU 18 can receive the data and determine the address where the data will be stored. For example, the PMU 18 can determine in which of the stores 22 to store the data. In examples where the stores 22 are annular stores or FIFO stores, the PMU 18 can store the information for indicators that identify the beginning and end of the stores 22. For annular stores, the PMU 18 can also store the information for indicators that identify the beginning of valid data and the end of valid data.
[0077] Therefore, instead of the cores including instructions that make the programmable computing units determine the addresses where the produced data will be stored and where the data will be retrieved for consumption, the PMU 18 can be configured to determine the addresses where data is produced and where data will be retrieved for consumption. In this way, the GPU 16 may not waste clock cycles and the ALUs of the programmable computing units may not waste the processing power that determines addresses where data will be stored or where data will be retrieved.
[0078] Furthermore, the PMU 18 can be configured to determine when the data to be consumed is ready for consumption. For example, instead of global memory 20 storing an atomic counter, PMU 18 can store the atomic counter locally within IC 12 (inside registers in a local cache memory within IC 12, for example). As an example, when a programmable computing unit, which executes a producer execution stream, transmits data, the PMU 18 can read the atomic counter value stored internally, modify the atomic counter value based on the amount of data produced, and write the modified atomic counter value inside IC 12. In this example, when the programmable computing unit, which executes the consumer execution stream, reads the atomic counter value, the GPU 16 may not need to determine the atomic counter value by accessing memory global 20 not built-in. Instead, the PMU 18 can provide the atomic counter value.
[0079] In some examples, the PMU 18 that stores the atomic counter value locally can reduce the spin. For example, a programmable computing unit, which executes a consumer execution stream, can transmit a request for data that will be consumed by the consumer execution stream. In this example, the PMU 18 can determine if the data to be consumed is available (based on the locally stored atomic counter value, for example).
[0080] If the PMU 18 determines that the data is not yet available for consumption, the PMU 18 may indicate to the programmable computing unit that the programmable computing unit should switch to a different execution stream (from the same core or possibly from a different kernel, for example) that does not have data not yet available. In other words, the PMU 18 may indicate that the consumer execution stream that needs the data not yet available should be put on hold so that the programmable computing unit can continue executing other execution streams then when the data is available as determined by the PMU 18 based on the locally stored value of the atomic counter, the PMU 18 may instruct the programmable computing unit to switch back to the standby execution stream (i.e. wake up the execution stream) so that the programmable computation unit can execute the consumer execution stream using the data now available. In this way, when the data is not yet available for consumption, the programmable computing unit, which executes the consuming execution streams, is able to execute other kernel execution streams instead of remaining in the busy/waiting state.
[0081] As another example, when two producer execution streams of the same core executed in different programmable computing units try to write data at the same time in the same store 22, the PMU 18 can allow access to one of the producer execution streams and deny access to the other producer execution stream. In this example, the PMU 18 can instruct the programmable computing unit, which executes the execution stream denied access, to execute other kernel execution streams. When write access to stores 22 becomes available, as determined by PMU 18, PMU 18 may indicate to the programmable computing unit that was executing the stream denied access that write access to stores 22 is now available. In this way, the programmable computing unit that executes the stream of execution to which access has been denied is capable of executing additional streams of execution.
[0082] In the same way, when two consumer execution streams try to read data at the same time from the same store 22, the PMU 18 can allow access to one of the consumer execution streams and deny access to the other consumer execution stream. Similar to the example where two execution streams write at the same time, in this example where two execution streams read at the same time, the PMU 18 can instruct the programmable computing unit that executes the execution stream denied access to execute other streams of execution. When read access to stores 22 becomes available, as determined by PMU 18, PMU 18 may indicate to the programmable computing unit that was running the transmission stream denied access that read access to stores 22 is now available. In this way, the programmable computing unit, which executes the stream of execution to which access has been denied, is capable of executing additional streams of execution.
[0083] In this way, the processor 14 that defines the stores 22 in global memory 20 and the PMU 18 that manages the state of the stores 22 in global memory 20 can allow the efficient execution of cores in the way of channeling by the GPU 16. As an example , the PMU 18 can minimize the number of non-embedded accesses needed to run cores in the pipeline manner. As another example, since the PMU 18 can determine addresses where data is to be stored or where data is to be retrieved, the GPU 16 may not waste energy and clock cycles determining such addresses by executing instructions within the cores. to determine such addresses. In other words, the PMU 18 can determine the addresses where data will be stored or retrieved from without the execution flows including instructions to determine where the data will be stored and where it will be retrieved. In addition, PMU 18 can allow programmable computing units to run streams of execution of cores without spinning, for example, when data from a producer core is not yet available, PMU 18 can allow other streams of execution from a consumer core (execution streams that do not require data from the producer core, for example) are executed.
[0084] Figure 2 is a block diagram that shows a graphics processing unit (GPU) and a global memory in more detail. For example, Figure 2 shows the GPU 16 and global memory 20 from Figure 1 in more detail. As shown the GOPU 16 includes an operations specific graphics processor 26, a fixed function unit 30, a channel management unit (PMU 18, a cache 34, a programmer 40, registers 44. In some examples, registers 44 may be part of the cache 34. In the example shown in Figure 2 the PMU 18 is shown as being formed inside the GPU 16. As described above, however, the PMU 18 can be formed outside the GPU 16 and on the same integrated circuit as the GPU 16.
[0085] The operations-specific graphics processor 26 may include programmable computing units 28A-28N (commonly referred to as programmable computing units 28), which may be thought of as shading cores. Fixed function units 30 include fixed function computing units 32A-32N (commonly referred to as fixed function computing units 32). The operations specific graphics processor 26 and the fixed function units 30 may include one or more of the programmable computing units 28 and the fixed function units 32 (in a greater or lesser number than shown, for example).
[0086] The programmable computing units 28 can function as described above. For example, the programmable computing units 28 can run both graphics and non-graphics related applications (shaders and cores, processor). For example, the programmable computing units 28 can run cores that are written in device language (such as Opencl language). As described above, one of the programmable computing units 28 may include local memory for storing intermediate results and for sharing between execution streams of a kernel executed in that programmable computing unit 28. The local memory of each of the programmable computing units 28 may not be accessible by other programmable computing units 28. In some examples it is possible for one of the programmable computing units to schedule the time that another of the programmable computing units 28 will execute streams of execution of a kernel.
[0087] In some instances, one of the programmable computing units may transmit data to one or more other programmable computing units 28. For example, to run cores in the manner of channeling, a first of the programmable computing units 28, which performs a producing execution stream, can transmit data (data not related to graphics, for example) to a second of the programmable computing units 28. As described above, the transmission of one of the programmable computing units 28 (the programmable computing unit that executes the producer execution stream, for example) may store data in a store, such as one of the global memory 20 stores 22, and receive one of the programmable computing units 28 (the programmable computing unit that executes the execution stream consumer, for example) can retrieve the data from one of the stores 20 of the global memory 20.
[0088] As shown in Figure 2, in some examples GPU 16 may include an internal cache 34. However cache 34 may be internal to the IC instead of being limited to being internal to GPU 16. In some examples, instead of to store data produced in a non-embedded way (in global memory 20, for example), it is possible that GPU 16 stores data internal to GPU 16 or IC 12 as well. For example, the transmission from one of the programmable computing units 28 may store the data in one or more of the stores 36A-36N (collectively referred to as stores 36) in the cache 34, which is inside the GPU 16 in the example of Figure 2, but can be inside the IC 12 and outside the GPU 16. Receiving one of the programmable computing units 28 can retrieve data from the stores 36 in the cache 34. The stores within the cache 34 can be cache-reserve stores from the stores 22 in the global memory 20. In other words, the global memory 20 stores 22 can store the complete data produced by a producer execution stream that will be consumed by a consumer execution stream, and the stores 36 can function as a cache that stores some of the data produced for quick access compared to accessing global memory data 20.
[0089] Stores 36 within cache 34 may be similar to stores 22. For example, stores 36 may be FIFO stores or null stores. It is desirable for cache 34 to include stores 33 to avoid memory latency and power consumption associated with accessing non-embedded memory (the stores 22 of global memory 20, for example). However, using only stores 36 may not be practical due to the limited space available for storage. In this example, it is possible to store some of the data within stores 36 and allow it to be spilled into stores 22.
[0090] Stores 36 and stores 22 can allow GPU 16 to run cores in the manner of channeling. For example, stores 36 and stores 22 can be thought of as data structures that provide communication between programmable computing units 28. Stores 36 and stores 22 can be configured to store more data than the minimum amount of data. cores running on programmable computing units can transmit (more than one unit of data, for example). In this way, the execution flows of a kernel, executed in one of the programmable computing units 28, are capable of producing a variable amount of data that are stored in the stores 36 and in the stores 22 and that can be passed to execution flows of another core, running on another of the programmable computing units 28, for consumption.
[0091] Fixed function computing units 32 can provide fixed functionality and can be formed as hardware units (as a non-limiting example). Fixed-function computing units 32 can be thought of as executing specific embedded cores that are written using device language. For example, while programmable computing units can provide functional flexibility, fixed-function computing units 32 can be limited in their respective functional flexibility. For example, the fixed function computing units 32 may include tracking units, primitive assembly units, viewport transform units and other such units that provide specific graphics functionality.
[0092] In some instances, the fixed-function computing units 32 can be wired to perform their respective specific functions. Furthermore, it is possible for the fixed function computing units to schedule when another of the fixed function computing units 32 will be executed. Furthermore, in some cases, if the GPU 16 does not include a specific unit of the fixed-function computing units 32, it is possible to develop a core that performs the function of the unavailable fixed-function computing unit. In other words, the kernel can emulate the fixed-function behavior of the unavailable fixed-function computation unit. For example, if a fixed-function inlay is not available, the developer can develop an inlay core that emulates the fixed-function behavior of the inlay and executes the core in one or more of the programmable computation units 28.
[0093] In some examples, the GPU 16 may include a programmer 40. The programmer 40 may include execution and operation flows to the various programmable computing units 28 and fixed function units 32. For example, the programmer 40 may balance the load of the tasks performed by the programmable computing units 28 such that none of the programmable computing units 28 is overutilized while others are underutilized. Scheduler 40 can be implemented as hardware or software running on hardware.
[0094] In Figure 2 global memory 20 may include a store 42A-42N (collectively referred to as stores 42) and cache 34 may include stores 38A-38N (collectively referred to as stores 38). Stores 38 may not necessarily be in every example, and may form an optional built-in cache to provide cache reserve storage for the commands stored in stores 42. Stores 42 and stores 38 can be thought of as command queues. There may be a command queue (one of stores 42 and stores 38, for example) for all programmable computing units 28 and one queue for each type of fixed-function computing unit 32. Stores 42 and stores 38 may store zero or more entries.
[0095] Stores 42 and optional embedded stores 38 can help with organizing the operational load scheduling for programmable computing units 28 and fixed function computing units 32. For example, stores 42 can store commands which instruct the programmable computing units 28 and the fixed function computing units 32 to perform various tasks. For example, each entry in the stores 42 can store information to make one or more available programmable computing units 28 execute the execution streams of the cores, as well as store information for core argument values and dependency information. In some instances, it may be necessary to satisfy dependencies between execution streams of a kernel before one or more programmable computation units 28 execute the kernel.
[0096] The stores 22 can be accessible either by the processor 14 (Figure 1) or by the GPU 16. As an example, the processor 14 can access the stores 22 using calls according to the various APIs described above. The GPU 16 can access the stores 22 based on the cores running in the programmable computing units 28. For example, the cores can be developed as functions to store the data produced in global memory 20.
[0097] As shown, GPU 16 can also include a channel management unit (PMU) 18. As described above, PMU 18 can manage the state of stores 22 within global memory 20. In addition, PMU 18 can manage the state of stores 36 within the cache 34.
[0098] For example, the PMU 18 can manage the state of stores 22 and stores 36 by storing the length and width of stores 22 and stores 36, including the number of stores 22 and stores 36 that are available to store data produced. As an example, the PMU 18 can allocate the stores 22 ahead of the cores executed in the programmable computing units 28 and can deallocate the stores 22 at the end of the execution of the cores.
[0099] As another example, the PMU 18 can store information for the header indicator, current offset, maximum depth and the like in the built-in registers 44. In some examples, the PMU 18 can store information about the state of the stores 22 and stores 36 of similar to how texture parameters are stored in graphics processing.
[0100] Stores 22 may require management in order to determine stores 22 to store data in, or retrieve data from, determine storage locations from where to store the data in the stores or from where to retrieve the data (determining addresses, for example) and ensure that different programmable computing units 28 do not attempt to access information from the stores that cause data corruption. PMU 18 can be in charge of such management. For example, with the GPU 16 including the PMU 18 or the IC including the GPU 16 including the PMU 18, the management of the stores 22 can be located inside the IC including the GPU 16 and not outside the IC. This can result in reduced power consumption as well as efficient running of cores running in the programmable computing units 28.
[0101] As an example, the PMU 18 can store an atomic counter inside registers 44. Registers 44 can be part of cache 34 or part of some other memory inside GPU 16 or IC 12. The atomic counter can indicate to be the access to one of the programmable computing units 28 is available (such as if data is available for reading or if two or more cores are trying to write or read at the same time from the same stores 22). Based on the atomic counter, the PMU 18 is able to appropriately allow access to one of the programmable computing units 28 while denying access to the other of the programmable computing units 28 so as to avoid data corruption from the stores 22, which can occur if two execution streams try to write data at the same time. In some instances, when the PMU 18 denies access to one of the programmable computing units 28, the PMU 18 may allow the task requesting access (an execution flow, for example) to go to the wait state and allow the Denied unit of programmable computing units 28 continue performing other tasks (execution streams, for example). When access to the denied programmable computing unit 28 becomes available, the PMU 18 can wake up that task and provide the data to that task for further execution. In this way, the programmable computing units 28 may not be completely idle, and other tasks of the programmable computing units 28 may be performed.
[0102] In some examples, when it is necessary to retrieve data from a store 22 from global memory 20, the PMU 18 is capable of retrieving additional data in addition to the necessary data. For example, the PMU 18 can determine the starting and ending location of the requested data. However, the PMU 18 can retrieve additional data that is stored in the stores 22 after the location after determining the final location of the requested data. PMU 18 can retrieve such additional data when PMU 18 determines that storage space is available in stores 36. As described above, PMU 18 can manage both stores 22 in global memory 20 and stores 36 within cache 34. PMU 18 can then store the retrieved data in cache 34. In this way, additional data is already available within the GPU 16 when such data is needed. Storing additional data (such as data in addition to the requested data) in stores 36 can also reduce the number of times the GPU 16 has to access non-embedded data (from global memory 20, for example).
[0103] To access data the programmable computing units 28 can use indicators to access the stores (the cores can be developed to access data using indicators, for example). In some instances, the PMU 18 may maintain indication information so that the programmable computing units 28 are able to properly access the data. For example, the programmable computing units 28 may transmit specialized instructions requesting information about the stores 22 to the PMU 18. such instructions may include information for the number of elements within the stores, how much data is stored within the store (store width, for example) where the information is stored and other such information. In this way, the act of ensuring that the programmable computing units 28 properly access the stores 22 can be performed internally to the IC that houses the GPU 16, which possibly reduces external access to the IC that houses the GPU 16.
[0104] As an example, in order to ensure that data does not become corrupted or lost, a producer core can be developed to include instructions to query the range of stores 22 (start and end points, for example). In this example, the programmable computing unit of the programmable computing units 28 that is running the producer core may transmit the store band query 22 to the PMU 18. The PMU 18 may have stored the store band 22 information in registers 44 (by receiving such information from processor 14 when processor 14 has defined stores 22, for example). The PMU 18 can return the result of the range of stores 22 to the producer core.
[0105] As another example, to run cores as a pipe, in some examples it may be necessary to maintain the order of the data in the pipe. Suppose, for example, that the first core will produce data that will be consumed by a second core. In this case, however, it is possible that the third core is also running for the same amount of time as the first and second cores are running. In this case, it is possible that the data produced by the first core and the data produced by the third core are reordered, which possibly results in the consumption of the incorrect data by the second core.
[0106] To ensure proper ordering in some examples, in addition to the atomic counter which indicates whether stores 22 are available for access, the PMU 18 can store additional atomic counters in registers 44. These additional atomic counters may be referred to as apparatus atomic counters . For example, there may be an atomic apparatus counter associated with each of the stores 22. In addition, the PMU 18 or programmer 40 can be configured to assign a token to each execution stream of each core that defines the relative position of where the data produced by that execution stream will be stored in stores 22. This token for an execution stream can be a current device atomic counter value.
[0107] For example, the PMU 18 can assign a first consumer execution stream that will consume the data first with a token value of 0, assign a second consumer execution stream that will consume the data second with the value of token of 1, and so on. Each of these consumer execution streams can request the value of the appliance atomic counter from PMU 18. If the current value of the appliance atomic counter is equal to the token value of the consumer execution stream, then the consumer execution stream can consume the Dice. Otherwise, the consumer execution flow cannot consume the data.
[0108] After the consumer execution flow, whose token value is equal to the atomic device counter value, consumes the data, the PMU 18 can update the value of the atomic device counter. In some examples, the amount of data that the consumer execution stream will consume may be fixed, and the PMU 18 may update the atomic device counter value after the fixed amount of data is retrieved from the stores 22. In some examples, however, the amount of data that the consuming execution flow will consume may not be fixed. In these examples, after the consumer execution stream has just received the data, the consumer execution stream can indicate to PMU 18 that PMU 18 must increase the value of the appliance atomic counter so that the next consumer execution stream can consume the data. In this way, the apparatus atomic counter, whose value the PMU 18 can store in registers 44 and update, can ensure that the order in which the data will be consumed is preserved, and that consumer execution flows that should not receive the data out of order do not receive data out of turn.
[0109] As another example, the PMU 18 can store information in registers 44 in order to reduce the number of deadlock possibilities. For example, as described above, processor 14 can be configured to set an amplification factor for stores 22 that indicates the maximum number of elements that can be produced by a core execution stream for storage in one of stores 22. If the kernel produces more data than defined by the amplification factor then the kernel may be deadlocked (stop executing, for example). Processor 14 can supply the amplification factor value to PMU 18, and PMU 18 can store the amplification factor value into registers 44.
[0110] In some examples, in order to minimize deadlock possibilities, the developer can include instructions in the kernel that request the value of the amplification factor. The programmable computing unit 28 that runs the core can transmit the request of the amplification factor value to the PMU 18. The PMU 18 can in turn indicate the value of the amplification factor to the programmable computing unit 28 that runs the core . If the programmable computing unit 28 determines that the amount of data produced by the kernel's execution streams will be greater than the amplification factor, the programmable computing unit 28 may interrupt the kernel's execution once the amount of data produced equals the amplification factor and can schedule the execution of the remaining kernel execution streams once the already produced data is consumed.
[0111] In addition to or instead of the above techniques to minimize deadlock, the PMU 18 can implement advance programming in which the PMU 18 can store data in stores 22 until the data produced equals the amplification factor. PMU 18 can then store the remaining data in stores 36. In other words, PMU 18 can ensure that requests to store data in store 22 are within the "safe" range and that any request to store data in store 22 is in instead stored in the store 36.
[0112] Figure 3 is a flowchart showing an exemplary technique according to one or more examples described in this disclosure. As shown in Figure 3, one of the programmable computing units can execute one or more streams of execution from a core in the specific operations graphics processor 26 of the GPU 16 (46). The PMU 18, which is inside the IC 12 or inside the GPU 16, can receive from the programmable computing unit 28 a request to store data in or retrieve data from the global memory 20, which is external to the IC 12 for the execution stream(s) of the core (48).
[0113] The PMU 18 can determine if access is permissible for the programmable computing unit 28 that requested the storage or retrieval of the data (50). If access is not permissible (NOT of 50), the programmable computing unit 28 can execute additional kernel execution streams (52). In this example, the PMU 18 can indicate to the programmable computing unit when access is available.
[0114] If access is available (SIM of 50), the programmable computing unit 28 can determine a location within a store (a store 22, for example) in global memory 20, where the data will be stored or from where will be recovered (52). For example, the PMU 18 can determine the location (ie, address) within global memory 20 where data will be stored or where data will be retrieved (54). Based on the determined location, the GPU 16 can then store the data at or retrieve the data from the determined location within one of the stores 22 within global memory 20 (56).
[0115] In some examples, in order to determine the location within the store 22, the PMU 18 can determine the location without the kernel execution stream or streams indicating the location where the data will be stored or where it will be retrieved in memory global 20. In this way, it is not necessary for the cores to include instructions to determine the location within global memory 20 where to store the data or where the data will be retrieved.
[0116] In some examples, the PMU 18 may retrieve data in addition to the requested data. In this example, the PMU 18 can store additional data in cache 34. In some examples, the PMU 18 can receive status information from the stores 22 of the processor 14. In these examples, the PMU 18 can determine the location within the store 22 where the data will be stored or retrieved from based on the status information received.
[0117] Figure 4 is a flowchart showing another exemplary technique according to one or more examples described in this disclosure. As shown, a first programmable computing unit (one of the programmable computing units 28, for example) of the operations-specific graphics processor 26 of the GPU 16 can execute the first execution stream (58). A second programmable computing unit (another of the programmable computing units 28, for example) of the operations-specific graphics processor 26 of the GPU 16 can execute a different second execution stream (60).
[0118] The PMU 18, which is inside the IC 12 that includes the GPU 16, can receive from the first programmable computing unit a request to store the data produced by the execution of the first execution stream in a store (one of the stores 22, by example) in global memory 20, which is external to IC 12 (62). In this example, the data produced by the execution of the first execution stream (a producer execution stream, for example) will be consumed by the second programmable computation unit that executes the second execution stream (a consumer execution stream, for example). In addition, the store can be one of a first in first out (FIFO) store and a void store where a void store is an example of a FIFO store.
[0119] The PMU 18 can determine a location within the store where the data produced by the execution of the first execution stream will be stored (64). The IC 12 can store data produced by the execution of the first execution stream at the determined location within the store (66). It is to be understood that the IC 12 which stores the data produced by the execution of the first execution stream at the determined location within the store includes the IC 12 which stores the data, the GPU 16 which stores the data and/or the PMU 18 which stores the data. In other words, the IC 12 that stores the data means the IC 12 or any component of the IC 12 that stores the data.
[0120] In some examples, PMU 18 may store state information about stores 22 within IC 12 (within registers 44, for example). The PMU 18 may receive such status information from stores 22 of processor 14. The status information from stores 22 may include one or more of a start address of stores 22, an end address of stores 22, an address within stores 22 where the data produced will be stored and an address within the stores where the data will be retrieved. In these examples, the PMU 18 can determine the location within the store where the data produced by the execution of the first execution stream will be stored based on the stored state information from the stores 22. Furthermore, in some examples the PMU 18 can determine the location within the store where the data produced by the execution of the first execution stream will be stored without the first execution stream indicating the location where the data will be stored in the store.
[0121] The PMU 18 can also receive from the second programmable computing unit that executes the second execution stream a request to retrieve at least some of the data produced by the execution of the first execution stream. The PMU 18 can determine whether data that is produced by execution by the first execution stream is available for retrieval for consumption by the second programmable computing unit which executes the second execution stream. In some examples, the PMU 18 may receive the request from the second programmable computing unit at the same time as, before or after receiving the request from the first programmable computing unit to store the data produced by executing the first execution stream.
[0122] When the data requested by the second execution stream is not available for retrieval for consumption by the second programmable computing unit that executes the second execution stream, the PMU 18 may indicate to the second programmable computing unit to execute a third stream of execution. The PMU 18 may also indicate to the second programmable computing unit when the data requested by the second execution stream is available for retrieval for consumption by the second programmable computing unit executing the second execution stream. The PMU 18 may instruct the second programmable computing unit to execute the second execution stream so as to consume the data requested by the second execution stream when the data requested by the second execution stream is available for retrieval for consumption by the second computing unit programmable that runs the second stream of execution.
[0123] In some cases, the first execution flow can be a execution flow producing a core and the second execution flow can be a consuming execution flow of the same core. In some cases, the first execution stream might be an execution stream from a producer core, and the second execution stream might be an execution stream from a consumer core.
[0124] Figure 5 is a block diagram that shows a device in Figure 1 in more detail. For example, Figure 5 also shows a device 10. Examples of device 10 include but are not limited to wireless devices, mobile phones, personal digital assistants (PDAs), video game consoles that include video monitors, video units mobile conferencing, laptop computers, desktop computers, television set-top box converters, tablet computing apparatus, e-book readers and the like. Apparatus 10 may include processor 14, GPU 16, global memory 20, monitor 68, user interface 70, and transceiver module 72. In the example shown, PMU 18 is formed within GPU 16. For example, PMU 18 may be formed within the same IC that houses GPU 16 (ie IC 12). Also as shown, GPU 16 resides inside IC 12. However processor 14 can also reside inside IC 12.
[0125] Apparatus 10 may include additional modules or units not shown in Figure 4 for clarity. For example, apparatus 10 may include a speaker and a microphone, neither of which are shown in Figure 4, for carrying out telephone communications in example in which apparatus 10 is a wireless mobile phone. In addition, the various modules and units shown in apparatus 10 may not be needed in each example of apparatus 10. For example, user interface 70 and display 68 may be external to apparatus 10 in examples where application 10 is a desktop computer. As another example, user interface 70 may be part of display 68 in examples where display 68 is a touch-sensitive or presence-sensitive display of a mobile device.
[0126] Processor 14, GPU 16, PMU 18 and global memory in Figure 4 can be similar to processor 14, GPU 16, PMU 18 and global memory 20 in Figure 1. User interface examples 70 include, but are not limited to, a trackball, mouse, keyboard, and other types of input devices. User interface 70 may be a touch-sensitive screen and may be incorporated as a part of monitor 68. Transceiver module 72 may include circuitry to allow wireless or wired communication between apparatus 10 and another apparatus or a network. Transceiver module 72 can include modulators, demodulators, amplifiers, and other such circuitry for wired or wireless communication. Monitor 68 may comprise a liquid crystal display (LCD), a cathode ray tube (CRT) display, a plasma display, a touch sensitive monitor, a presence sensitive monitor or other type of display apparatus.
[0127] In one or more examples, the functions described can be implemented in hardware, software, firmware or any combination of them. If implemented in software, the functions can be stored or transmitted, as one or more instructions or codes, through a computer-readable medium and executed by a hardware-based processing unit. The computer-readable media may include computer-readable storage media, which corresponds to a tangible medium, such as a data storage medium or a communication medium that includes any medium that facilitates the transfer of a program. computer from one place to another, according to a communication protocol, for example. In this way, the computer-readable media may generally correspond to (1) tangible computer-readable storage media that are non-transient or (2) a communication medium such as a signal or carrier wave. The data storage media can be any available means that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. A computer program product may include a computer-readable medium.
[0128] By way of example, and not limitation, such computer readable storage media may comprise RAM, ROM, EEPROM, CD-ROM, or other optical disk storage, magnetic disk storage or other storage devices magnetic, flash memory or any other medium that can be used to store the desired program code in the form of instructions or data structure and that can be accessed by a computer. Furthermore, any connection is appropriately termed a computer-readable and readable medium. For example, if instructions are transmitted from a website, server, other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL) or wireless technologies such as infrared, radio and microwave, so axial cable, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave are included in the definition of medium. It should be understood, however, that computer readable storage media and data storage media do not include connections, carrier waves, signals or other transient media, but are instead directed to tangible, non-transient storage media. The term disc (DISK and DISC) as used herein includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disc and Blu-ray disc, where discs (DISKS) usually reproduce data magnetically , while discs (DISCS) reproduce data optically with lasers. Combinations of the above elements should also be included within the scope of computer readable media.
[0129] Instructions can be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field-programmable logic arrays (FPGAs) or other integrated logic circuits or equivalent discrete logic. Accordingly, the term "processor" as used herein may refer to any of the foregoing structures, or any other suitable structure for implementing the techniques described herein. Furthermore, in some respects, the functionality described here can be presented within dedicated hardware and either software modules configured for encoding and decoding, or incorporated into a combined CODEC. Furthermore, the techniques can be fully implemented in one or more circuits or logic elements.
[0130] The techniques of this disclosure can be implemented in a wide variety of devices or equipment, including a wireless telephone handset, an integrated circuit (IC) or a set of ICs (a chipset, for example). Various components, modules or units are described in this disclosure to emphasize the functional aspects of apparatus configured to perform the disclosed techniques, but do not necessarily require execution by different hardware units. Instead, as described above, several units can be combined into one hardware CODEC unit or provided by a collection of interoperable hardware units, including one or more processors described above, together with software and/or firmware.
[0131] Several examples have been described. These and other examples are within the scope of the claims which follow.

权利要求:
Claims (11)
[0001]
1. Method for performing data processing operations in a threading session, the method comprising: performing a first line in a first programmable computing unit (28A) of a graphics processor of specific operations of a graphics processing unit, GPU , (26) wherein the operations-specific graphics processor includes a plurality of programmable computing units including the first programmable computing unit (28A); executing a second line in a second programmable computing unit (28N) of the plurality of GPU operations specific graphics processor programmable computing units (26); receive, directly with a management unit (18) within an integrated circuit, IC (16) including the GPU, a request from the first programmable computation unit (28A) to store data produced by executing the first line in a store ( 22A-22N) in a global memory external to the IC (16) shared by the plurality of programmable computing units, in which the data produced by executing the first line will be consumed by the second programmable computing unit (28N) that executes the second line, and wherein the store (22A-22N) comprises a store (22A-22N) of the first-in-first-out, FIFO type; determining, directly with the management unit (18), a location within the store (22A-22N) where the data produced by the execution of the first line will be stored; storing, with the IC (16), the data produced by the execution of the first line in the determined location within the store (22A-22N); the method characterized by also comprising: storing, with the management unit (18), storage status information (22A-22N) within the IC (16), wherein the storage status information (22A-22N) includes a or more than one store start address (22A-22N), one store end address (22A-22N), one address inside the store (22A-22N) where the output data will be stored, and one address inside the store (22A -22N) where the data will be retrieved; wherein determining the location within the store (22A-22N) comprises determining the location within the store (22A-22N) where the data produced by executing the first row will be stored based on stored status information from the store (22A-22N) ); and wherein the method further comprises: receiving, with the management unit (18), a request from the second programmable computing unit (28N) that executes the second line to retrieve at least some of the data produced by executing the first line; and determining, with the management unit (18), whether the data that is produced by executing the first line is available for recovery for consumption by the second programmable computing unit (28N) that executes the second line.
[0002]
Method according to claim 1, characterized in that receiving the request from the second programmable computing unit (28N) comprises receiving the request from the second programmable computing unit (28N) at the same time, before or after receiving the request from the first programmable computing unit (28A) for storing the data produced by the execution of the first line.
[0003]
3. Method according to claim 1, characterized in that it also comprises: when the data requested by the second line is not available for recovery for consumption by the second programmable computing unit (28N) that performs the second line, indicate, with the unit management (18), to the second programmable computing unit (28N) for executing a third line; indicating, with the management unit (18), to the second programmable computing unit (28N) when the data requested by the second line is available for retrieval for consumption by the second programmable computing unit (28N) which executes the second line; and indicating, with the management unit (18), the second programmable computing unit (28N) to execute the second line to consume the data requested by the second line when the data requested by the second line is available for retrieval for consumption by the second unit (28N) programmable computation that runs the second line.
[0004]
4. Method according to claim 1, characterized in that it also comprises: retrieving, with the management unit (18), data from global memory in addition to the data requested by the second line; and storing, with the management unit (18), the data in addition to the data requested by the second line in a cache inside the IC (16).
[0005]
Method according to claim 1, characterized in that performing the first line comprises executing a producer line of a core, and wherein executing the second line comprises executing a consumer line of the core.
[0006]
Method according to claim 1, characterized in that executing the first line comprises executing the first line of a producer core, and wherein executing the second line comprises executing a line of a consumer core.
[0007]
Method according to claim 1, characterized in that the GPU includes the management unit (18) and wherein the FIFO store comprises a circular store.
[0008]
Method according to claim 1, characterized in that determining the location within the store (22A-22N) comprises determining the location within the store (22A-22N) to where the data produced by executing the first line will be stored without the first line indicates the location where the data will be stored in the store (22A-22N).
[0009]
9. Apparatus, comprising: a global memory shared by a plurality of programmable computing units including a store (22A-22N), wherein the store (22A-22N) comprises a first-in-first-out type store, FIFO; and an integrated circuit, IC (16), comprising: a graphics processing unit, GPU, comprising: mechanisms for executing a first line; mechanisms for executing a second line; and mechanisms for directly receiving a request from the mechanisms to execute the first line to store the data produced by the execution of the first line in the store (22A-22N) in global memory, in which the data produced by the execution of the first line will be consumed through the mechanisms for executing the second line; and mechanisms for directly determining a location within the store (22A-22N) where data produced by means of the mechanisms for executing the first row will be stored; and mechanisms for storing the data produced by the execution of the first line at the determined location within the store (22A-22N); the equipment characterized in that it also comprises: the management unit (18) is configured to store status information from the store (22A-22N) within the IC (16); the store state information (22A-22N) includes one or more of a store start address (22A-22N), a store end address (22A-22N), an address within the store (22A-22N) where the produced data will be stored, and an address within the store (22A-22N) where the data will be retrieved; the management unit (18) is configured to determine the location within the store (22A-22N) to where the data produced by executing the first line will be stored based on stored status information from the store (22A-22N); and wherein the equipment further comprises the management unit (18) being configured to: receive a request from the second programmable computing unit (28N) that executes the second line to retrieve at least some of the data produced by executing the first line; and determining whether the data that is produced by executing the first line is available for retrieval for consumption by the second programmable computing unit (28N) that executes the second line.
[0010]
The apparatus of claim 9, characterized in that: the mechanisms for performing the first row comprise a first programmable computing unit (28A) and the mechanisms for performing the second row comprise a second programmable computing unit (28N); and the mechanisms for executing the second row comprise receiving the request from the first programmable computing unit (28A) to store data produced by executing the first row in a store (22A-22N) in global memory and to determine the location within the store ( 22A-22N) where the data produced by the execution of the first line will be stored comprising a management unit (18).
[0011]
11. Memory characterized by comprising instructions that, when executed, cause one or more processors to execute the method as defined in any one of claims 1 to 8.

类似技术:

公开号 | 公开日 | 专利标题

BR112014018434B1|2021-07-27|BUFFER MANAGEMENT FOR PARALLEL GRAPHICS PROCESSING UNIT

JP2021152956A|2021-09-30|Backward compatibility test of software in mode of interrupting timing

JP6430970B2|2018-11-28|Operating system execution on processors with different instruction set architectures

KR101392109B1|2014-05-07|Providing state storage in a processor for system management mode

US9430807B2|2016-08-30|Execution model for heterogeneous computing

US20160019168A1|2016-01-21|On-Demand Shareability Conversion In A Heterogeneous Shared Virtual Memory

US8117389B2|2012-02-14|Design structure for performing cacheline polling utilizing store with reserve and load when reservation lost instructions

US10489297B2|2019-11-26|Prefetching time allocation

JP2017538212A|2017-12-21|Improved function callback mechanism between central processing unit | and auxiliary processor

US20170109214A1|2017-04-20|Accelerating Task Subgraphs By Remapping Synchronization

US9779469B2|2017-10-03|Register spill management for general purpose registers |

TW201346831A|2013-11-16|Apparatus and method for memory-hierarchy aware producer-consumer instruction

US8862786B2|2014-10-14|Program execution with improved power efficiency

US8490071B2|2013-07-16|Shared prefetching to reduce execution skew in multi-threaded systems

US9983874B2|2018-05-29|Structure for a circuit function that implements a load when reservation lost instruction to perform cacheline polling

US8949777B2|2015-02-03|Methods and systems for mapping a function pointer to the device code

JP2020523652A|2020-08-06|Individual tracking of loads and stores pending

US9437172B2|2016-09-06|High-speed low-power access to register files

US20210224213A1|2021-07-22|Techniques for near data acceleration for a multi-core architecture

同族专利:

公开号 | 公开日

BR112014018434A2|2017-06-20|

EP2807646A1|2014-12-03|

KR101707289B1|2017-02-27|

WO2013112692A1|2013-08-01|

JP6081492B2|2017-02-15|

US9256915B2|2016-02-09|

CN104081449B|2016-11-09|

KR20140125821A|2014-10-29|

JP2015513715A|2015-05-14|

CN104081449A|2014-10-01|

US20130194286A1|2013-08-01|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

JP2500101B2|1992-12-18|1996-05-29|インターナショナル・ビジネス・マシーンズ・コーポレイション|How to update the value of a shared variable|

US7117481B1|2002-11-06|2006-10-03|Vmware, Inc.|Composite lock for computer systems with multiple domains|

US7673304B2|2003-02-18|2010-03-02|Microsoft Corporation|Multithreaded kernel for graphics processing unit|

JP3853829B1|2005-10-31|2006-12-06|株式会社ソニー・コンピュータエンタテインメント|Drawing processing apparatus, parallel processing apparatus, and exclusive control method|

US7928990B2|2006-09-27|2011-04-19|Qualcomm Incorporated|Graphics processing unit with unified vertex cache and shader register file|

US8087029B1|2006-10-23|2011-12-27|Nvidia Corporation|Thread-type-based load balancing in a multithreaded processor|

US8135926B1|2008-10-21|2012-03-13|Nvidia Corporation|Cache-based control of atomic operations in conjunction with an external ALU block|

US8392925B2|2009-03-26|2013-03-05|Apple Inc.|Synchronization mechanisms based on counters|

JP2010287110A|2009-06-12|2010-12-24|Nec Personal Products Co Ltd|Information processor, information processing method, program, and recording medium|

BR112012001629B1|2009-07-28|2019-10-22|Ericsson Telefon Ab L M|method of synchronizing event processing associated with application sessions on a telecommunications processing platform, resource adapter, and java enterprise edition teaming|

US9245371B2|2009-09-11|2016-01-26|Nvidia Corporation|Global stores and atomic operations|

US8817031B2|2009-10-02|2014-08-26|Nvidia Corporation|Distributed stream output in a parallel processing unit|

US8810592B2|2009-10-09|2014-08-19|Nvidia Corporation|Vertex attribute buffer for inline immediate attributes and constants|

JP2013541748A|2010-07-19|2013-11-14|アドバンスト・マイクロ・ディバイシズ・インコーポレイテッド|Data processing using on-chip memory in a multiprocessing unit.|

US20120092351A1|2010-10-19|2012-04-19|Apple Inc.|Facilitating atomic switching of graphics-processing units|

US9092267B2|2011-06-20|2015-07-28|Qualcomm Incorporated|Memory sharing in graphics processing unit|US10817043B2|2011-07-26|2020-10-27|Nvidia Corporation|System and method for entering and exiting sleep mode in a graphics subsystem|

US9304730B2|2012-08-23|2016-04-05|Microsoft Technology Licensing, Llc|Direct communication between GPU and FPGA components|

CN103810124A|2012-11-09|2014-05-21|辉达公司|Data transmission system and data transmission method|

US9977683B2|2012-12-14|2018-05-22|Facebook, Inc.|De-coupling user interface software object input from output|

US10096079B2|2013-06-10|2018-10-09|Sony Interactive Entertainment Inc.|Fragment shaders perform vertex shader computations|

US10102603B2|2013-06-10|2018-10-16|Sony Interactive Entertainment Inc.|Scheme for compressing vertex shader output parameters|

US10176621B2|2013-06-10|2019-01-08|Sony Interactive Entertainment Inc.|Using compute shaders as front end for vertex shaders|

US10134102B2|2013-06-10|2018-11-20|Sony Interactive Entertainment Inc.|Graphics processing hardware for using compute shaders as front end for vertex shaders|

US9659399B2|2013-08-23|2017-05-23|Nvidia Corporation|System, method, and computer program product for passing attribute structures between shader stages in a graphics pipeline|

CN105683914B|2013-11-22|2019-09-24|英特尔公司|The method and apparatus for improving the performance of the chain type task in graphics processing unit|

US9589311B2|2013-12-18|2017-03-07|Intel Corporation|Independent thread saturation of graphics processing units|

US9679347B2|2014-02-18|2017-06-13|Qualcomm Incorporated|Shader pipeline with shared data channels|

US10055342B2|2014-03-19|2018-08-21|Qualcomm Incorporated|Hardware-based atomic operations for supporting inter-task communication|

US9710245B2|2014-04-04|2017-07-18|Qualcomm Incorporated|Memory reference metadata for compiler optimization|

KR102263326B1|2014-09-18|2021-06-09|삼성전자주식회사|Graphic processing unit and method of processing graphic data using the same|

GB2524346B|2014-09-19|2016-12-21|Imagination Tech Ltd|Separating Cores|

US9412147B2|2014-09-23|2016-08-09|Apple Inc.|Display pipe line buffer sharing|

US9659407B2|2015-01-26|2017-05-23|MediaTek Singapore, Pte. Lte.|Preemptive flushing of spatial selective bins for deferred graphics processing|

US9652817B2|2015-03-12|2017-05-16|Samsung Electronics Co., Ltd.|Automated compute kernel fusion, resizing, and interleave|

CN104899039B|2015-06-12|2018-12-25|百度在线网络技术（北京）有限公司|For providing the method and apparatus of screenshotss service on the terminal device|

US10515430B2|2015-11-03|2019-12-24|International Business Machines Corporation|Allocating device buffer on GPGPU for an object with metadata using access boundary alignment|

CN105446939B|2015-12-04|2019-02-26|上海兆芯集成电路有限公司|The device of core enqueue is pushed away by device end|

US10025741B2|2016-01-13|2018-07-17|Samsung Electronics Co., Ltd.|System-on-chip, mobile terminal, and method for operating the system-on-chip|

US9799089B1|2016-05-23|2017-10-24|Qualcomm Incorporated|Per-shader preamble for graphics processing|

WO2017209876A1|2016-05-31|2017-12-07|Brocade Communications Systems, Inc.|Buffer manager|

US10572399B2|2016-07-13|2020-02-25|Qualcomm Incorporated|Memory request arbitration|

US20180122037A1|2016-10-31|2018-05-03|Intel Corporation|Offloading fused kernel execution to a graphics processor|

JP6817827B2|2017-01-23|2021-01-20|Ｎｅｃプラットフォームズ株式会社|Accelerator processing management device, host device, accelerator processing execution system, method and program|

US10430919B2|2017-05-12|2019-10-01|Google Llc|Determination of per line buffer unit memory allocation|

US11163546B2|2017-11-07|2021-11-02|Intel Corporation|Method and apparatus for supporting programmatic control of a compiler for generating high-performance spatial hardware|

US11163490B2|2019-09-17|2021-11-02|Micron Technology, Inc.|Programmable engine for data movement|

US11263064B2|2019-12-30|2022-03-01|Qualcomm Incorporated|Methods and apparatus to facilitate improving processing of machine learning primitives|

US11250538B2|2020-03-09|2022-02-15|Apple Inc.|Completion signaling techniques in distributed processor|

WO2022022937A1|2020-07-31|2022-02-03|Morphotonics Holding B.V.|Assembly for replicating flexible stamps from a master|

CN112104731A|2020-09-11|2020-12-18|北京奇艺世纪科技有限公司|Request processing method and device, electronic equipment and storage medium|

法律状态:
2018-12-04| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2020-01-07| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2020-09-24| B07A| Application suspended after technical examination (opinion) [chapter 7.1 patent gazette]|

2021-01-26| B06A| Patent application procedure suspended [chapter 6.1 patent gazette]|

2021-06-01| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2021-07-27| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 24/01/2013, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US201261591733P| true| 2012-01-27|2012-01-27|

US61/591,733|2012-01-27|

US13/747,947|2013-01-23|

US13/747,947|US9256915B2|2012-01-27|2013-01-23|Graphics processing unit buffer management|

PCT/US2013/022900|WO2013112692A1|2012-01-27|2013-01-24|Buffer management for graphics parallel processing unit|

[返回顶部]